-
Notifications
You must be signed in to change notification settings - Fork 1k
work on adding voyager to evals #959
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🦋 Changeset detectedLatest commit: 51246f6 The changes in this PR will be included in the next version bump. Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Summary
This PR adds two industry-standard evaluation suites to benchmark Stagehand's web automation capabilities: WebVoyager (643 test cases) and GAIA (90 test cases). The changes significantly expand the evaluation infrastructure to support data-driven benchmarking against established datasets.
The core architectural change introduces a suite-based evaluation system alongside the existing task-based approach. New suite builders (evals/suites/webvoyager.ts
and evals/suites/gaia.ts
) read JSONL dataset files and dynamically generate test cases, while corresponding task implementations (evals/tasks/agent/webvoyager.ts
and evals/tasks/agent/webarena_gaia.ts
) execute the actual evaluations. The system supports flexible sampling strategies using Fisher-Yates shuffle for randomized selection or deterministic first-N selection.
Key infrastructure improvements include:
- A new
core/summary.ts
module that extracts summary generation logic into a reusable component - Enhanced type system with optional
taskParams
andparams
fields to pass dataset-specific parameters to evaluation functions - New utility functions for JSONL parsing, data validation, and sampling in
evals/utils.ts
- Environment variable configuration for controlling test execution (sample sizes, limits, difficulty levels)
- Updated evaluation runner logic in
index.eval.ts
to handle both static tasks and dynamic dataset-driven evaluations
The datasets themselves are substantial additions: WebVoyager contains 643 web navigation tasks across 13+ websites (Amazon, Google services, GitHub, etc.), while GAIA provides 90 general AI assistant tasks with varying difficulty levels. Both datasets start from standardized URLs and expect structured response formats.
This integration maintains full backward compatibility with existing evaluations while providing the foundation for systematic benchmarking against industry standards. The sampling capabilities allow for both development testing (small samples) and comprehensive evaluation runs.
Confidence score: 4/5
- This PR is safe to merge with minimal risk as it maintains backward compatibility and adds well-structured evaluation capabilities
- Score reflects solid implementation patterns and comprehensive infrastructure changes, though there's a potential division-by-zero edge case in summary generation
- Pay close attention to
evals/core/summary.ts
for the division-by-zero issue in category success rate calculation
12 files reviewed, 3 comments
Resolved conflicts by merging agent task configurations and including both taskParams and agent properties in StagehandInitResult interface. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
Resolved conflicts by merging agent task configurations and including both taskParams and agent properties in StagehandInitResult interface. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
Add agent evaluation support to CI pipeline
added category to ci for external agent benchmarks:
|
@@ -39,6 +41,13 @@ for (const arg of rawArgs) { | |||
} | |||
} else if (arg.startsWith("provider=")) { | |||
parsedArgs.provider = arg.split("=")[1]?.toLowerCase(); | |||
} else if (arg.startsWith("--dataset=")) { | |||
parsedArgs.dataset = arg.split("=")[1]?.toLowerCase(); | |||
} else if (arg.startsWith("max_k=")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note for later: we should make this arg a bit more intuitive (along max number of evals or sth)
Co-authored-by: Miguel <[email protected]>
⏺ ## Why
Add WebVoyager and GAIA evaluation suites to benchmark Stagehand's web navigation and
reasoning capabilities against industry-standard datasets.
What Changed
evals.config.json
Environment Variables
EVAL_WEBVOYAGER_SAMPLE
: Random sample size from WebVoyager datasetEVAL_WEBVOYAGER_LIMIT
: Max cases to run (default: 25)EVAL_GAIA_SAMPLE
: Random sample size from GAIA datasetEVAL_GAIA_LIMIT
: Max cases to run (default: 25)EVAL_GAIA_LEVEL
: Filter GAIA by difficulty level (1, 2, or 3)Sampling Strategy
The sampling implementation uses Fisher-Yates shuffle for unbiased random selection
when
SAMPLE
is specified, otherwise takes the firstLIMIT
cases. This allows forboth deterministic (first N) and randomized (sample N) test runs.
Test Plan